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ABSTRACT 

Graphs are fundamental data structures and have been em- 
ployed for centuries to model real-world systems and phe- 
nomena. Random walk with restart (RWR) provides a good 
proximity score between two nodes in a graph, and it has 
been successfully used in many applications such as auto- 
matic image captioning, recommender systems, and link pre- 
diction. The goal of this work is to find nodes that have top- 
k highest proximities for a given node. Previous approaches 
to this problem find nodes efficiently at the expense of exact- 
ness. The main motivation of this paper is to answer, in the 
affirmative, the question, 'Is it possible to improve the search 
time without sacrificing the exactness?'. Our solution, K- 
dash, is based on two ideas: (1) It computes the proximity 
of a selected node efficiently by sparse matrices, and (2) It 
skips unnecessary proximity computations when searching 
for the top-k nodes. Theoretical analyses show that K-dash 
guarantees result exactness. We perform comprehensive ex- 
periments to verify the efficiency of K-dash. The results 
show that K-dash can find top-k nodes significantly faster 
than the previous approaches while it guarantees exactness. 

I. INTRODUCTION 

Recent advances in social and information science have 
shown that linked data pervade our society and the natu- 
ral world around us [24]. Graphs become increasingly im- 
portant to represent complicated structures and schema-less 
data such as is generated by Wikipedia 1 , Freebase 2 , and 
various social networks [10]. Due to the extensive applica- 
tions of graph models, vast amounts of graph data have been 
collected and graph databases have attracted significant at- 
tention in the database community. In recent years, various 
approaches have been proposed to deal with graph-related 
research problems such as subgraph search [24], shortest- 
path query [4] , pattern match [5] , and graph clustering [25] 
to get insights into graph structures. 

1 http: / / www.wikipedia.org/ 
2 
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With the rapidly increasing amounts of graph data, search- 
ing graph databases to identify high proximity nodes, where 
a proximity measure is used to rank nodes in accordance 
with relevance to a query node [13], has become an impor- 
tant research problem. Many papers in the database com- 
munity have addressed node-to-node proximities [20, 1, 13]. 
For example, Sun et al. proposed a novel proximity mea- 
sure called PathSim which produces good similarity qual- 
ities given heterogeneous information networks [20]. Sim- 
rank++, proposed by Antonellis et al., finds high proximity 
nodes effectively for historical click data [1]. One of the 
most successful techniques known to the academic commu- 
nities is based on random walk with restart (RWR) [19]. This 
is because the proximity defined by RWR yields the follow- 
ing benefits: (1) it captures the global structure of the graph 
[8] , and (2) it captures multi-facet relationships between two 
nodes unlike traditional graph distances [21]. 

However, the computation of the proximities by RWR is 
computationally expensive. Consider a random particle that 
starts from query node q. The particle iteratively moves to 
its neighborhood with the probability that is proportional 
to their edge weights. Additionally, in each step there is a 
probability that it will restart at node q. A node probabil- 
ity changes over time during iterations by recursively apply- 
ing the above procedures. As the result, the steady-state 
probability can be obtained. The proximity of node u with 
respect to node q is defined as the steady-state probability 
with which the particle stays at node u. 

Although RWR has been receiving increasing interests 
from many applications [15, 11, 12, 19], its excessive CPU 
time led to the introduction of approximate approaches [22, 
19]. These approaches have the advantage of speed at the 
expense of exactness. However, approximate algorithms are 
not well adopted. This is because it is difficult for approxi- 
mate algorithms to enhance the quality of real applications. 
Therefore, we address the following problem in this paper: 

Problem (Top-k search for RWR). 

Given: The query node q, and the required number of 
answer nodes K . 

Find: Top K nodes with the highest proximities with 
respect to node q exactly. 

To the best of our knowledge, our approach to finding 
top-k nodes in RWR is the first solution to achieve both 
exactness and efficiency at the same time. 

1.1 Contributions 

We propose a novel method called K-dash that can effi- 
ciently find top-k nodes in RWR. In order to reduce search 
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cost, (1) we use sparse matrices to compute the exact prox- 
imity of a selected node, and (2) we prune low proximity 
nodes by estimating the proximities of those nodes without 
computing their exact proximities. K-dash has the following 
attractive characteristics based on the above ideas: 

• Exact: K-dash does not sacrifice accuracy even though 
it exploits an estimation-based approach to prune un- 
likely nodes; it returns top-k nodes without error un- 
like the previous approximate approaches. 

• Efficient: K-dash practically requires 0(n + m) time 
where n and m are the number of nodes and edges, 
respectively. By comparison, solutions based on exist- 
ing approximate algorithms are expensive; they need 
0(n 2 ) time to find the answer nodes. Note that m <§C 
n 2 in practice [14]. 

• Nimble: K-dash practically needs 0(n + m) space 
while the previous approaches need 0(n 2 ) space. The 
required memory space of K-dash is smaller than that 
of the previous approximate approaches. 

• Parameter-free: The previous approaches require 
careful setting of the inner-parameter [22], since it 
impacts the search results. K-dash, however, is com- 
pletely automatic; this means it does not require the 
user to set any inner-parameters. 

While RWR has been used in many applications, it has 
been difficult to utilize it due to its high computational cost. 
However, by providing exact solutions in a highly efficient 
manner, K-dash will allow many more RWR-based applica- 
tions to be developed in the future. 

The remainder of this paper is organized as follows. Sec- 
tion 2 describes related work. Section 3 overviews the back- 
ground of this work. Section 4 introduces the main ideas 
of K-dash. Section 5 gives theoretical analyses of K-dash. 
Section 6 reviews the results of our experiments. Section 7 
provides our brief conclusion. 

2. RELATED WORK 

Node-to-node proximity is an important property. One of 
the most popular proximity measurements is RWR, and re- 
searchers of data engineering have published many papers on 
RWR and its applications [15, 11, 12, 19, 22]. With our ap- 
proach, many applications can be processed more efficiently. 

Application. Automatic image captioning is a technique 
which automatically assigns caption words to a query im- 
age. Pan et al. proposed a graph-based automatic caption 
method in which images and caption words are treated as 
nodes in a mixed media graph [15]. They utilized RWR to 
estimate the correlations between the query image and the 
captions. They reported that their method provided 10 per- 
cent higher captioning accuracy than a fine-tuned method. 

Recommendation systems aim to provide personalized rec- 
ommendations of items to a user. One recent recommenda- 
tion technique proposed by Konstas et al. is based on RWR 
over a graph that connects users to tags and tags to items, 
where the probabilities of relevance for items are given by 
RWR proximities; high interest items would have high prox- 
imities. They incorporate the additional information such 



as friendship and social tagging embedded in social knowl- 
edge to improve the accuracy of item recommendations [11]. 
They also applied a standard collaborative filtering method 
as a baseline, and showed that their method was superior. 

The question of 'which new interactions among social net- 
work members are more likely to occur in the near future?' 
is being avidly pursued by many researchers. Schifanella 
et al. proposed a metadata based approach for this prob- 
lem [17]. Their idea is that members with similar interests 
are more likely to be friends, so semantic similarity measures 
among members based on their annotation metadata should 
be predictive of social links. Liven-Nowell et al. explored 
this question by using RWR [12]; the probability of a fu- 
ture collaboration between authors is computed from RWR 
proximity. Their approach is based on the observation that 
the topology of the network can suggest many new collabo- 
rations. For example, two researchers who are close in the 
network will have many colleagues in common, and thus are 
more likely to collaborate in the near future. They took the 
RWR-based approach since it can capture the global struc- 
ture of the graph. They showed that RWR provides better 
link predictions than the random prediction approach. 

Approximation method. Even though RWR is very useful, 
one problem is its large CPU time. Sun et al. observed that 
the distribution of RWR proximities is highly skewed. Based 
on this observation, combined with the factor that many real 
graphs have block- wise/partition structure, they proposed 
an approximation approach for RWR [19]; they performed 
RWR only on the partition that contains the query node. 
All nodes outside the partition are simply assigned RWR 
proximities of 0. In other words, their approach outputs a 
local estimation of RWR proximities. 

Tong et al. proposed a fast approximation solution for 
RWR. They designed B_LIN and its derivative, NB_LIN 
[22] . These methods take advantages of the block- wise struc- 
ture and linear correlations in the adjacency matrix of real 
graphs, using the Sherman-Morrison Lemma [16] and the 
singular value decomposition (SVD). Especially for NBJLIN, 
they showed the proof of an error bound. The experimental 
results showed that their methods outperformed the approx- 
imation method of Sun et al. [19]. Their methods require 
0(n 2 ) space and 0{n 2 ) time. This is because their meth- 
ods utilize 0(n 2 ) size matrices to approximate the adjacency 
matrix for proximity computation. 

3. PRELIMINARY 

In this section, we formally define the notations and in- 
troduce the background of this paper. Table 1 lists the main 
symbols and their definitions. 

Measuring the proximity of two nodes in a graph can be 
achieved using RWR. Starting from a node, a RWR is per- 
formed by iteratively following an edge to another node at 
each step. Additionally, at every step, there is a non-zero 
probability, c, of returning to the start node. Let p be a col- 
umn vector where p„ denotes the probability that the ran- 
dom walk is at node u. q is a column vector of zeros with 
the element corresponding to the starting node q set to 1, 
i.e. q q = 1. Also let A be the column normalized adjacency 
matrix of the graph. In other words, A is the transition 
probability table where its element A uv gives the probabil- 
ity of node u being the next state given that the current 
state is node v. The steady-state, or stationary probabili- 
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Table 1: Definition of main symbols. 



Symbol 


Definition 


q 


query node 


K 


Number of answer nodes 


n 


Number of nodes 


m 


Number of edges 


c 


the restart probability 


P 


n X 1 vector, p u is the proximity of node u 


q 


rtx 1 vector, the q-th clement 1 and for others 


A 


the column normalized adjacent matrix 



ties for each node can be obtained by recursively applying 
the following equation until convergence: 

p = (l-c)Ap + cq (1) 

where the convergence of the equation is guaranteed [18]. 
The steady-state probabilities give us the long term visit rate 
of each node given a bias toward query node q. Therefore, 
p u can be considered as a measure of proximity of node u 
with respect to node q. 

This method needs 0(mt) time where t is the number of 
iteration steps. This incurs excessive CPU time for large 
graphs, and a fast solution is demanded as illustrated by 
the statement 'its on-line response time is not acceptable in 
real life situations' made in a previous study [11]. It should 
be emphasized that shortening response time is critical to 
enhancing business success in real web applications 3 . 

4. PROPOSED METHOD 

In this section, we explain the two main ideas underlying 
K-dash. The main advantage of our approach is to exactly 
and efficiently find top-k highest proximity nodes for RWR. 
First, we give an overview of each idea and then a full de- 
scription. Proofs of lemmas or theorems in this section are 
shown in Appendix A. 

4.1 Ideas behind K-dash 

Our solution is based on the following two approaches: 

Sparse matrices computation. The proximities for a query 
node are the steady-state probabilities which are computed 
by recursive procedures as described in Section 3. This ap- 
proach requires high computation time because it computes 
the proximities of all the nodes in the graph. Our idea is 
simple; we compute the proximities of only selected nodes 
enough to find the top-k nodes, instead of computing the 
proximities of all nodes. 

As described in Section 4.2.1, the proximities of selected 
nodes are naively computed by the inverse matrix that can 
be directly obtained from Equation (1). Therefore, if we 
precompute and store this inverse matrix, we can get the 
proximities efficiently However, this approach is impractical 
when the dataset is large, because it requires quadratic space 
to hold the inverse matrix. 

We introduce an efficient approach that can compute the 
proximities from sparse matrices. In the precomputing pro- 
cess, we reorder nodes and compute the inverse matrices 
of the lower/upper triangulars obtained by LU decomposi- 
tion as described in Section 4.2.2. A lower/upper triangular 
matrix is a matrix where all the elements above/below the 

http:/ /www. kcynotc.com/downloads/Zona_Nccd_For_Spccd.pdf 



main diagonal are zero. As a result, the inverse matrices 
are sparse, and we can compute the proximities of the se- 
lected nodes with low memory consumption by using the 
adjacency-list representation [6]. 

This new idea has the following two major advantages be- 
sides the one described above. First we can compute the 
proximities exactly. This is because LU decomposition, un- 
like SVD which is used in the previous methods [22], is not 
an approximation method. The second advantage is that we 
can compute the proximities efficiently This is because we 
use sparse matrices to compute the proximities. 

Tree estimation. Although our sparse matrices approach 
is able to compute the proximities of selected nodes, we 
have the following two questions to find the top-k nodes: (1) 
'What nodes should be selected to compute the proximities 
in the search process?', and (2) 'Can we avoid computing 
the proximities of unselected nodes?'. The second approach 
is designed to answer these two questions. 

These questions can be answered by estimating what nodes 
can be expected to have high/low proximities. Our pro- 
posal exploits the following observations: the proximity of 
a node declines as the number of hops from the query node 
increases, and proximities of unselected nodes can be esti- 
mated from computed proximities. Our search algorithm 
first constructs a single breadth-first search tree rooted at 
the query node. We compute the proximities of the top-k 
nearest nodes from the root node to discover answer can- 
didate nodes. We then estimate the proximities of unse- 
lected nodes from the proximities of already selected nodes 
to obtain the upper proximity bound. The time incurred 
to estimate node proximity is O(l) for each node. In the 
search process, if the upper proximity bound of a node gives 
a score lower than the K-ih highest proximity of the can- 
didates nodes, the node cannot be one of the top-k highest 
proximity nodes. Accordingly, unnecessary proximity com- 
putations can be skipped. 

This estimation allows us to find the top-k nodes exactly 
while we prune unselected nodes. This means we can safely 
discard unlikely nodes at low CPU cost. This estimation ap- 
proach also allows us to automatically determine the nodes 
for which we compute the proximities. This implies our ap- 
proach avoids to have user-defined inner-parameters. 

4.2 Sparse matrices computation 

Our first approach is to obtain sparse inverse matrices to 
compute the proximities of selected nodes efficiently. In this 
section, we first describe how to compute the proximities 
by inverse matrices. We then describe that obtaining the 
sparse inverse matrices is an NP-complete problem, and we 
then show our approximate approach for the problem. 

4.2.1 Proximity computation 
From Equation (1), we can obtain the following equation: 

p = c{I - (1 - c^A}"^ = cW _1 q (2) 

where I represents the identity matrix and W = I— (1 — c)A. 
This equation implies that we can compute the proximities 
of selected nodes by obtaining the corresponding elements 
in the inverse matrix W - . However, this approach requires 
high memory consumption. This is because the inverse ma- 
trix W _1 would be dense even though the matrix W itself 
is sparse [16] (In many real graphs, the number of edges is 
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much smaller than the squared number of nodes [14]). That 
is, this approach requires 0(n 2 ) space. 

We utilize the inverse matrices of lower/upper triangu- 
lars to compute the proximities in our approach. Formally, 
the following equation gives the proximities for the query 
node, where the matrix W is decomposed to LU by the LU 
decomposition (i.e. W = LU). 

p = cU _1 L _1 q (3) 

Note that the matrices L _1 and LP 1 are lower and upper 
triangular, respectively. 

4.2.2 Inverse matrices problem 

As shown in Equation (3), if we precompute the matri- 
ces L _1 and LP 1 , we can compute proximities of selected 
nodes. However, this raises the following question: 'Can the 
matrices L _1 and LP 1 be sparse if matrix W _1 is dense?'. 
Our answer to this question is to compute the sparse ma- 
trices L and LP 1 by reordering the columns and rows of 
the sparse matrix A. But finding the node order in matrix 
A that yields the sparse matrices is NP-complete. 

Theorem 1 (Inverse matrices problem). 
Determining the node order that minimizes non-zero ele- 
ments in matrices L _1 and LP 1 is NP-complete. 

Because the inverse matrices problem is NP-complete, 
we introduce an approximation to address this problem. Be- 
fore we describe our approaches in detail, we show the ma- 
trix elements of L _1 and LP 1 can be represented by those of 
L and U by forward/backward substitution [16] as follows: 

f (i < j) 

L£ = \ I/La (i = j) (4) 

I (i>j) 

f (i>j) 
U^ = \ 1/Uij _ (i=j) (5) 

I -Vt^£i =i+ iEW/ (i<j) 

where the matrix elements of L and U can be represented 
by those of W by Crout's algorithm [16] as follows: 

r o (i<j) 

L.^l 1 . . ( <= J) (6) 

[ Wii [Wij - LikUkj) (i>j) 

r o a>j) 

u i:i = { Wij (i < j n i = i) (7) 

{ Wtj - EL=i LikU k j (»<ini/i) 

Equation (4), (5), (6), and (7) imply that elements of L _1 , 
LP 1 , L, and U are computed from the columns from left to 
right, and within each column from top to bottom. For ex- 
ample, element L" 1 can be computed from the correspond- 
ing upper/left elements of L and L _1 , and element Lij can 
be computed from the corresponding upper/left elements of 
W, L, and U. 

Our approaches are based on the following three observa- 
tions in the above four equations: (1) elements Lr. 1 and U^ 1 
would be zero if the corresponding upper/left elements of L 
and U are zero, (2) upper/left elements of L and U would 
be zero if the corresponding upper/left elements of W are 
zero, and (3) the upper/left elements of W would be zero 
if the corresponding upper/left elements of A are zero since 
A = (I — W)/(l — c). That is, we can effectively compute 
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Figure 1: Reordering methods. 

the sparse inverse matrices by letting upper/left elements of 
matrix A be zero. 

We introduce the following three approximation solutions 
against the inverse matrices problem: 

Degree reordering. In this approach, we arrange nodes of 
the given graph in ascending order of degree (the number 
of edges incident to a node) and rename them by the order. 
Low degree nodes have few edges, and the upper/left ele- 
ments of corresponding matrix A are expected to be with 
this approach. 

Cluster reordering. This approach first divides the given 
graph into k partitions by Louvain Method [3], and it ar- 
ranges nodes according to the partitions. Note that the 
number of partitions, k, is automatically determined by Lou- 
vain Method. It then creates new empty K+l-th partition. 
Finally, if a node of a partition has an edge to another par- 
tition, it rearranges the node to the K+l-th partition. As a 
result, matrix A would be a doubly-bordered block diagonal 
matrix [16] as shown in Figure l-(2); elements correspond 
to cross-partitions edges would be for k partitions 4 . 

We use the Louvain Method because it is an efficient ap- 
proach 5 and it utilizes the modularity [3] as the quality 
measure for partitioning. The modularity assesses the fit- 
ness of node partitioning, in the sense that there are many 
edges within a partition and only a few between them. That 
is, Louvain Method reduces the number of cross-partition 
edges. As a result, this approach should yield more sparse 
inverse matrices. 

Hybrid reordering. This approach is a combination of the 
degree and the cluster reordering. That is, we arrange nodes 
by cluster reordering, and we then sort nodes in each parti- 
tion by their degree. This approach makes matrix A have 
no cross-partition edges for k partitions, and the upper/left 
elements of each partition are expected to be 0. 

Figure 1 illustrates matrix A obtained by above each ap- 
proximation approach for the inverse matrices problem, 
where zero and non-zero elements are shown in white and 
gray, respectively. Algorithm 1, 2, and 3 in Appendix B 
show the details of degree reordering, cluster reordering, and 
hybrid reordering, respectively. 

Owing to these three approaches, we can effectively obtain 
sparse matrices L and U, and then sparse matrices L _ and 
LP 1 . As demonstrated in Section 6, these approaches can 
drastically reduce the memory needed to hold matrices L 1 
and LP 1 ; they have practically linear space complexity in 
the size of edges in the given graph by using the adjacency- 
list representation [6]. 

4 Doubly-bordered block diagonal matrix D is defined by D uv — 
0,V{(u,t;)|P(u) # P(v),P(u) k + l,P(v) / K + 1} where P(u) 
is the partition number of node u. 

For all data in our experiments, Louvain Method can compute par- 
titions in a few seconds. 



445 



4.3 Tree estimation 

We introduce an algorithm for estimating the proximities 
of unselected nodes in the search process effectively and ef- 
ficiently. In this approach, a node is visited one by one, and 
we estimate its proximity. If the estimated proximity is not 
lower than the A"-th highest proximity of candidate nodes, 
then the node is selected to compute the exact proximity. 
Otherwise we skip subsequent exact proximity computations 
of visited nodes. This approach, based on a single breadth- 
first search tree, yields the upper bounding score estimations 
of visited nodes. In this section, we first give the notations 
for the estimation, next we formally introduce the estima- 
tion, and then our approach to incremental estimation in 
the search process. 

4.3.1 Notation 

In the search process, we construct a single breadth- first 
search tree that is rooted on the query node; thus it forms 
layer 0. The direct neighbors of the root node form layer 1. 
All nodes that are i hops from the root node form layer i. 

The set of nodes in the graph is defined as V, and the set of 
selected (i.e., exact proximity computed) nodes is defined as 
V 3 . The layer number of node u is denoted as l u . Moreover, 
the set of selected nodes prior node u whose layer number is 
l u is defined as Vi u (u), that is Vi u (u) = {«:(«£ V s ) H (l v = 
lu)}- A ma x is the maximum element in matrix A, that is 
A max = max{ Aij : i,j G V}. A max (u) is the maximum 
element from node u, that is A max (u) — m&x{Ai u : i £ V}. 

Note that both A max and A ma x(u) can be precomputed. 
It requires O(l) space to hold A ma x, and it requires 0(n) 
space to hold A ma x(u) of all n nodes. 

4.3.2 Proximity estimation 

We describe the definition of the proximity estimation in 
this section. We also show that the estimation gives a valid 
upper proximity bound. We estimate the proximity of node 
u via breadth-first search tree as follows: 

Definition 1 (Proximity estimation). // node u is 
not the query node (i.e. u =fc q), the following equation gives 
the definition of proximity estimation of node u, p u , to skip 
proximity computation in the search process: 

P U=C '\ PvA max (v)+ PvAmax(v) 

[vev lu _ 1 (u) »£\(«) ^ 

+ ^1 - ^ Pvj Amax^j 

where c' = (1 — c)/(l — A uu + cA uu ). 
If node u is the query node (i.e. u = q), p„ = 1. 

It needs 0(n) time to compute the estimation for each 
node if we compute it according to Definition 1. This is 
because Vj u _i(u), Vi u (u), and V s all have size of 0(n). We, 
however, compute the estimation in 0(1) time as described 
in Section 4.3.3. 

To show the property of our proximity estimation, we in- 
troduce the following lemma: 

Lemma I (Proximity estimation). p u > p u holds 
for node u in the given graph. 

This lemma enables us to find the answer nodes exactly. 



The determination of the root node of the tree and the 
selection of proximity computation nodes are important in 
achieving efficient pruning. We determine the query node 
as the root node, and we visit and select nodes in increas- 
ing order of layer number. This is because: (1) a few nodes 
which are just a few hops from the query node have high 
proximities, and (2) we can estimate the proximities of vis- 
ited nodes from those of selected nodes (see Definition I). 
As a result, we can effectively estimate the proximities of 
visited nodes. 

If the estimated proximity of a visited node is lower than 
the K-t\i highest proximity of the candidate nodes, we termi- 
nate the search process without computing the estimations 
of other unvisited and unselected nodes. However, this raise 
the following question: 'Can we find the answer nodes ex- 
actly if we terminate the search process?'. To answer this 
question, we show the following lemma: 

Lemma 2 (Layer search). // nodes are visited and 
selected in ascending order of layers, p u > p v holds for node 
u and v such that l u < l v and u,v ^ q. 

Lemma 2 implies that the estimated proximity of a vis- 
ited node can not be lower than that of an unvisited and 
unselected node on the same/lower layer. Therefore, if the 
estimated proximity of a visited node is lower than the K-th 
highest proximity of the candidate nodes, all other unvisited 
and unselected nodes have lower proximities than the K-th 
highest proximities of the candidate nodes. Thus we can 
safely terminate the search process. 

4.3.3 Incremental computation 

As described in Section 4.3.2, by Definition I, 0(n) time is 
required to compute the estimated proximity for each node. 
In this section, we show our approach to efficiently compute 
the estimated proximity. We assume that node u is visited 
and selected immediately after node v! in the search process. 
In other words, we visit and select these nodes in order v! 
and u. In this section, let p u ,i, Pu,2, and p u ,3 be first, second, 
and third terms in Equation (8), respectively. That is, p u — 

c'{pu,l + pu,2 +Pu,-i)- 

We compute the estimation of u as follows: 

Definition 2 (Incremenatal update). For the given 
graph and query node, if u' =fc q, we compute the first, sec- 
ond, and third terms of the estimation of node u from those 
of u' in the search process as follows: 

( py A ifl(u) = /(«') 

\ Pu',2 +Pu'A ma x(u') otherwise 

- _ ( Pu',2 +Pu'A max (u') lfl(u)=l{u) (9) 

1 otherwise 

Pu,3 = (Pu'.s/Amax Pu' ) A ma x 

If u = q, p u ,i = PqAmaxiq), pu.2 = 0, and p u>3 = (1 - 

Pq) A max (u) • 

We provide the following lemma for the incremental com- 
putation in the search process: 

Lemma 3 (Incremenatal update). For node u, the 
estimated proximity can be exactly computed at the cost of 
0(1) time by Definition 2. 

This property enables K-dash to efficiently compute the 
estimated proximity in the search process. 
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4.4 Search algorithm 

Even though a detailed search algorithm of K-dash is de- 
scribed in Algorithm 4 in Appendix B.2, we outline our 
search algorithm as follows to make the paper self-contained: 

1. We construct a breadth- first search tree rooted at the 
query node. 

2. We visit a node in ascending order of tree layer and 
compute its estimated proximity by Definition 2. 

3. If the estimated proximity of the visited node is not 
lower than the K-th proximity of the candidate nodes, 
the node can be an answer node. Therefore, we com- 
pute the proximity of the node and return to step 2. 

4. Otherwise, we terminate the search process since the 
node and other unselected nodes can not be answer 
nodes (Lemma 2). 

5. THEORETICAL ANALYSIS 

This section provides a theoretical analysis that confirms 
the accuracy and complexity of K-dash. Let n be the num- 
ber of nodes. Proofs of each theorem in this section are 
shown in Appendix A. 

We show that K-dash finds the top-k highest proximity 
nodes accurately (without fail) as follows: 

Theorem 2 (Exact search). K-dash guarantees the 
exact answer in finding the top-k highest proximity nodes. 

We discuss the complexity of the existing approximate 
algorithm BJLLN and NBJLIN [22] and then that of K-dash. 

Theorem 3 (The approximate algorithm). B_LIN 
and NB-LIN both require 0(n 2 ) space and 0(n 2 ) time to find 
the top-k highest proximity nodes. 

Theorem 4 (Space complexity of K-dash). K-dash 
requires 0(n 2 ) space to find the top-k highest proximity nodes. 

Theorem 5 (Time complexity of K-dash). K-dash 
requires 0(n 2 ) time to find the top-k highest proximity nodes. 

Theorems 3, 4, and 5 show that K-dash has, in the worst 
case, the same space and time complexities as the previ- 
ous approximate approaches. However, the space and time 
complexities of K-dash is, in practice, 0(n + m), which are 
smaller than those of the previous approximate approaches. 
This is because the number of non-zero elements in the in- 
verse matrices is O(m) as shown in the next section. In the 
next section, we confirm the effectiveness of our approaches 
by presenting the results of extensive experiments. 

6. EXPERIMENTAL EVALUATION 

We performed experiments to demonstrate K-dash's ef- 
fectiveness in a comparison to NBJLIN by Tong et al. [22] 
and Basic Push Algorithm by Gupta et al. [7]. NB_LIN 
was selected since, as reported in [22], it outperforms the 
iterative approach and the approximation approach by Sun 
et al. [19], and it yields similar results to B_LIN, which 
was also proposed by by Tong et al., in all of our datasets. 
NBJLIN has one parameter: the target rank of the low-rank 
approximation. We used SVD as the low-rank approxima- 
tion as proposed by Tong et al. Note that NBJLIN can com- 
pute proximities quickly at the expense of exactness. Basic 
Push Algorithm is an approach that can find top-k nodes 



efficiently for Personalized PageRank. The definitions of 
RWR and Personalized PageRank are very similar . Even 
though Avrachenkov et al. also proposed an efficient ap- 
proach for Personalized PageRank based top-k search [2], 
we compared K-dash to Basic Push Algorithm. This is be- 
cause Basic Push Algorithm uses precomputed proximities 
of hub nodes to estimate the upper bounding proximities 
[7]; this implies Basic Push Algorithm theoretically guar- 
antees that the recall of its answer result is always 1 while 
the approach of Avrachenkov et al. does not. Basic Push 
Algorithm is an approximate approach and the number of 
answer nodes yielded by this approach can be more than K. 
Our experiments will demonstrate that: 

• Efficiency: K-dash can outperform the approximate 
approach by several order of magnitude in terms of 
search time for the real datasets tested (Section 6.1). 

• Exactness: Unlike the approximate approach, which 
sacrifices accuracy, K-dash can find the top-k nodes 
exactly (Section 6.2). 

• Effectiveness: The components of K-dash, sparse ma- 
trices computation and tree estimation, are very effec- 
tive in identifying top-k nodes (Section 6.3). 

The results of the application of K-dash to a real dataset 
are reported in Appendix D. 

We used the following five public datasets in the exper- 
iments: Dictionary, Internet, Citation, Social, and Email. 
The details of datasets are reported in Appendix C. In this 
section, K-dash represents the results of finding the top five 
nodes with the hybrid reordering approach. We set the 
restart parameter, c, at 0.95 as in the previous works [22, 
8]. We evaluated the search performance through wall clock 
time. All experiments were conducted on a Linux quad 3.33 
GHz Intel Xeon server with 32GB of main memory. We 
implemented our algorithms using GCC. 

6.1 Efficiency of K-dash 

We assessed the search time needed for K-dash, NBJLIN, 
and Basic Push Algorithm. Figure 2 shows the results. The 
results of K-dash are referred to as K-dash(K), where K 
is the number of answer nodes. We set the target rank of 
SVD to 100 and 1,000 (referred to as NB_LIN(100) and 
NB_LIN(1,000)). Note that the number of answer nodes, 
K, has no impact in NBJLIN since it computes approximate 
proximity scores for all nodes. BPA(K) indicates the results 
of Basic Push Algorithm where K is the number of answer 
nodes and the number of hub nodes is set to 1, 000. 

This figure shows that our method is much faster than the 
previous approaches under all conditions examined. Specifi- 
cally, K-dash is more than 68, 000 times faster than NBJLIN 
and 690, 000 times faster than Basic Push Algorithm. As 
described in Section 5, NBJLIN takes 0(n 2 ) time to com- 
pute proximities. Even though K-dash theoretically requires 
0(n 2 ) time as shown in Lemma 5, it does not, in practice, 
take 0(n 2 ) time to find the answer nodes. This is because 
the number of non-zero elements in the inverse matrices is 
0(m) in practice as shown in Section 6.3.1. That is, the time 
complexity of K-dash is, in practice, 0(n + m), because it 
takes 0(n + m) time for breadth- first tree construction and 



In Personalized PageRank, a random particle returns to the start 
node set, not the start node. 
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0(m) time for proximity computations. Therefore, our ap- 
proach can find the answer nodes more efficiently than the 
previous approaches. 

6.2 Exactness of the search results 

One major advantage of K-dash is that it guarantees the 
exact answer, but this raises the following simple question: 
'How successful are the approximate approaches in providing 
the exact answer even though it sacrifices exactness?'. 

To answer this question, we conducted comparative exper- 
iments. We used precision as the metric of accuracy. Pre- 
cision is the fraction of answer nodes among top-k results 
by each approach that match those of the original iterative 
algorithm. Figure 3 shows the precision and Figure 4 shows 
the wall clock time. In this experiment, we used various 
target rank setting and various number of hub nodes for 
NB_LIN and Basic Push Algorithm, respectively. We used 
Dictionary as the dataset in these experiments. 

As we can see from Figure 3, the precision of K-dash is 1 
because it finds the top-k nodes without fail. NB_LIN, on 
the other hand, has lower precision. Figure 4 shows that 
K-dash greatly reduces the computation time while it guar- 
antees the exact answer. The efficiency of NB JLIN depends 
on the parameters used. 

And Figure 3 and Figure 4 show that NB_LIN forces a 
trade-off between speed and accuracy. That is, as the target 
rank decreases, the wall clock time decreases but the preci- 
sion decreases. NBJLIN does not guarantee that the answer 
results are accurate, and so can miss the exact top-k nodes. 
K-dash also computes estimate proximities, but unlike the 
approximate approach, K-dash does not discard any answer 
nodes in the search process. 

Figure 3 shows that the precision of Basic Push Algo- 
rithm is almost constant against the number of hub nodes. 
Figure 4 indicates that the search speed of the approach 
increases as the number of hub nodes increases. This is 
because Basic Push Algorithm utilizes precomputed prox- 
imities of hub nodes to estimate the proximities of a query 
node. Figures 3 and 4 show that our approach is much faster 
than the previous approaches while guaranteeing exactness. 

6.3 Effectiveness of each approach 

In the following experiments, we examine the effective- 
ness of the two core techniques of K-dash: sparse matrices 
computation and tree estimation. 

6.3.1 Reordering approach 

K-dash utilizes the inverse matrices of lower/upper trian- 
gulars obtained by LU decomposition to compute the prox- 




Targel rank of SVD/Number of hub nodes Dictionary Infernel Citation Social Email 



imities of selected nodes in the search process. The number 
of non-zero elements in these inverse matrices is a factor 
that influences the search and memory cost. We take three 
approaches to reduce the number of non-zero elements as 
described in Section 4.2.2. Accordingly, we evaluated the 
ratio of the number of non-zero elements to that of edges 
in each reordering approach. Figure 5 shows the results. 
In this figure, Random represents the results achieved when 
nodes are arranged in random order. 

As we can see from the figure, our approaches (Degree, 
Cluster, and Hybrid reordering) yield many fewer non-zero 
elements than random reordering. This figure also indicates 
that our approach makes the number of non-zero elements 
near to that of the edges of the given graph in all datasets 
if we adopt Hybrid reordering approach. That is, the space 
complexity of K-dash is 0(m). Owing to the small size of 
the inverse matrices, K-dash achieves excellent search per- 
formance as shown in Figures 2. 

6.3.2 Precomputation time 

Our approach uses the inverse matrices of lower/upper 
triangulars in the search process. That is, these matrices 
must be computed in the precomputing process. Figure 6 
shows the process time in the precomputing process. 

Figure 6 indicates that our reordering approach can en- 
hance the process time; it is up to 140 times faster than the 
Random reordering approach. There are two reasons for this 
result. The first is that the inverse matrices have a sparse 
data structure as shown in Figure 5. The second is that 
elements of the inverse matrices which correspond to cross 
partition edges between 1st to K-th partition are zero due 
to Equation (4), (5), (6), and (7) 7 . Therefore we can effec- 
tively skip the computations of these elements. As a result, 
we can efficiently compute the inverse matrices. Additional 
experiments confirmed that our approach needs less precom- 
putation time due to its sophisticated sparse data structure 
than the other approaches. For example, NB_LIN needs 
several weeks to compute SVD because SVD requires 0(n 3 ) 
time, while our approach needs several hours. 

6.3.3 Tree estimation 

As mentioned in Section 4.3, K-dash skips unnecessary 
proximity computations in the search process. To show the 
effectiveness of this idea, we removed the pruning technique 
from K-dash, and reexamined the wall clock time. Figure 7 



For Dictionary, the improvement yielded by our approach was 
marginal. This is because the Louvain Method divides this dataset 
into one large partition and many small partitions which limits the 
effectiveness of our approach. 



Precision of Figure 4: Wall clock time Figure 5: Effect of re- 
of NB LIN. ordering approaches. 
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Figure 6: Precomputa- Figure 7: Effect of tree 
tion time. estimation. 



shows the result. K-dash without the pruning technique is 
abbreviated to Without pruning in this figure. 

The results show that the pruning technique can provide 
efficient search for all datasets used; these results indicate 
that our approach is effective for various edge distributions. 
K-dash is up to 1,020 times faster if the pruning method 
is used. This is because we can effectively terminate the 
search process with the estimation technique. These results 
(compare Figure 2 to Figure 7) also show that, by , K-dash 
can find the top-k nodes faster than NB_LIN even if K-dash 
computes the proximities of all nodes. To evaluate the effec- 
tiveness of our approach for various proximity distributions, 
we subjected it to additional evaluations using various val- 
ues of the restart probability c. The results confirmed that 
our approach can efficiently find the top-k nodes under all 
conditions examined; we can effectively prune unnecessary 
proximity computations. 

7. CONCLUSIONS 

This paper addressed the problem of finding the top-k 
nodes for a given node efficiently. As far as we know, this 
is the first study to address the top-k node search problem 
with the guarantee of exactness. Our proposal, K-dash, is 
based on two ideas: (1) ft computes the proximities of se- 
lected nodes efficiently by use of inverse matrices, and (2) 
ft skips unnecessary proximity computations in finding the 
top-k nodes, which greatly improves overall efficiency. Our 
experiments show that K-dash works as expected; it can 
find the top-k nodes at high speed; specifically, it is signif- 
icantly faster than existing approximate methods. Top-k 
search based on RWR is fundamental for many applications 
in various domains such as image captioning, recommender 
systems, and link prediction. The proposed solution allows 
the top-k nodes to be detected exactly and efficiently, and so 
will help to improve the effectiveness of future applications. 
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APPENDIX 
A. PROOFS 

In this section, we show the proofs of all lemmas and the- 
orems in this paper. 

A.l Theorem 1 

Proof. We prove the theorem by a reduction from the 
elimination ordering problem [23]. An instance of the 
elimination ordering problem consists of a graph, node 
elimination ordering, and the chordal graph that can be ob- 
tained by the graph and the elimination ordering. Given 
the graph, the elimination ordering problem finds the 
minimum number of edges whose addition makes the graph 
chordal by changing node ordering. 

We transform an instance of the elimination ordering 
problem to an instance of the inverse matrices problem as 
follows: for the graph of the elimination ordering prob- 
lem, we create matrix A. That is, we make the adjacency- 
list matrix from the graph. For the node elimination order- 
ing, we create node ordering, and we create matrix L 1 / 
U _1 for the chordal graph. 

Given this mapping, it is easy to show that there exists 
a solution to the elimination ordering problem with the 
minimum number of edge additions if and only if there exists 
a solution to the inverse matrices problem with the min- 
imum increase of non-zero elements in the inverse matrices. 
The inverse matrices problem is trivial in NP. □ 

A.2 Lemma 1 

Proof. If node u is not the query node, the following 
equation holds from Equation (1): 

Pu = (1 - c)(A„ipi + A u2 p 2 + . . . + A uu p u + ... + A un p n ) 

Since more than two upper/lower layer nodes can not be 
directly connected to node u in the breadth-first search tree, 
if the set of directly neighboring nodes (adjacent nodes) of 
node u is N u , p u is represented as follows: 



Pu — c ^ ^ A U vPv — c ^ ^ A uv p v 



v^u 



t>eiv„ 



5^ c < ^ ^ A uvPv -I- ^ ^ A uv Pv 

{ve{v lu _ 1 (u)+v lu (u)} vev\v a 



Since p v is probability, J2 ve v\v s P» 
Therefore, 



1 - J2 ve v s Pv holds - 



Pu < C < ^2 

I vev, _ 1 (u) 



vev,(u) 



+ 1 



vev s ) 



: Pu 



If node u is the query node, it is obvious p u > p u since 
p u = 1 and < p u < 1. Thus the inequality holds. □ 
Example. Let node iii be a query node in a directed 
graph in Figure 8. As described in Section 4.3, node/layer 
numbers of all the nodes are assigned by breadth-first search 
tree; node ui forms layer 0, node u 2 and 113 form layer 1, 
node 114 and tig form layer 2, and node ue and 117 form layer 
3. And we assume that we visit and select nodes in ascend- 
ing order of their node number. For node M5, the following 
equation holds from Equation (1) since A 5 i,A53, A57 — 0: 




Figure 8: An example graph. 

p U r = c'(A 5 ipi + A52P2 + A 53 p 3 + A 54 p4 + A 5fj p 6 + A 57 p 7 ) 

= c'(A 5 2P2 + A 54 p4 + A 56 p 6 ) 

Since the proximities of node u\, u 2 , u 3 and u 4 are already 
computed before node u 5 and node u&, the following equa- 
tion holds: 

Pu 5 <c{.P2A max (it 2) +P4. A max (w 4 ) + (1-P1-P2-P3-P4) A max )= pu 5 

Note that our estimation approach takes into account edges 
of selected nodes and unvisited nodes as A max (u) and A max , 
respectively. For example, non-tree edges A54 and A56 are 
taken as A max (u4) and A max , respectively. 

A.3 Lemma 2 

Proof. If l u — l v , it is obvious that p u = pv ■ If lu = lv — 1, 
the following inequality holds since V;„(t>) = 0: 

P~ v = c ' \ PwA max (w) +h-y Pw \ A max > < pu 

(u>ev lu (u) \ u>ev s I J 

And if l u > l v — 2, the following inequality similarly holds 
since V tv (v) = and Vj„_i(t;) = 0: 

Pv = C II ^ ^ Pw 1 A max < p u 
\ vj£V s I 

which completes the proof. □ 

A.4 Lemma 3 

Proof. We first prove that, if u' — q, p u can be exactly 
computed from p u ,i, Pu.2, and p u ,3- In this case, Vi u -i(u) — 
q, Vi u (u) — 0, and V 3 = q. Therefore, it is obvious that 
c'(pu,i +pu,2 +Pu,:i) = Pu holds by Definition 2. 

We next prove that, if u' =fc q, the estimate proximity of 
node u can be exactly computed from that of node u' . 

If lu = lu', V tu -i{u) = Vi u ,-i(u') and V lu {u) = V^Ju') + 
u' . Therefore, 

Pu,l-Pu',l= ^ PvA max (v)=0 

ve{v lu _^{u)-v lu , _iK)> 

and 

Pu,2 —Pu',2 = ^ PvA max {v) — Pu' A max (u) 

veiv^^-v^ju')} 

Otherwise (i.e. l u = l u > + 1), Vi u -i(u) = Vi u , («') +u' and 
Vi u ( u ) = 0- Therefore, 

Pu,i = ^2 PvA max (v) = p u ^ 2 +Pu'A max {u) 
ve{v t ,(«')+«'} 



and 



Pu.2 = ^2pvA max (v) = 

ve® 
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Since node u is visited immediately after node it', 
[pu',3 — Pu,3)/A max = p u i 

Therefore, p u ,i, pu,2, and p Uy 3 can be exactly computed 
from p u /_ 2 , andp u / >3 , respectively. 

We finally prove that it takes O(l) time to compute p u ,i, 
Pu,2, and p Uj 3 in the search process. If u' = q, p u ,i, Pu,2, 
and p Uj 3 are defined by Definition 2. As described in Sec- 
tion 4.3.1, both A max and A ma x{u) can be precomputed. 
Pu',i, Pu',2, Pu',3, and p u i are already computed before com- 
puting p u in the search process if v! 7^ q. This completes 
the proof. □ 

A.5 Theorem 2 

Proof. Let 9 be the K-th highest proximity among the 
candidate nodes in the search process. And let 9 k be the 
K-th highest proximity among the answer nodes (i.e. 9k is 
the lowest proximity among the answer nodes). 

We first prove that 9 is monotonic non-decreasing in the 
search process of K-dash. To find the answer nodes in the 
search process, we first set 9 at and set the dummy nodes 
as the candidates. We maintain the candidate as the best 
result; when we find a node with higher proximity, its prox- 
imity is greater than 9, the candidate is replaced by the new 
node (see Algorithm 4). This makes 9 higher. Therefore, 9 
keeps increasing in the search process. 

In the search process, since 9 is monotonic non-decreasing 
and 9k > 9, the estimate proximities of answer nodes are 
never lower than 9 (Lemma 1). The algorithm discards a 
node if (1) its estimated proximity is lower than 9, or (2) its 
upper/same layer unselected node has estimated proximity 
lower than 9. Since the estimated proximity of a node can 
not be lower than that of a node on the same or lower layer 
(Lemma 2), the answer nodes can never be pruned during 
the search process. □ 

A.6 Theorem 3 

Proof. We first prove that B_LIN and NB_LIN [22] both 
need 0(n 2 ) space. The off-line process of BJLIN first par- 
titions the adjacency matrix by METIS [9], and then de- 
composes the matrix into the within-partition edge matrix 
and the cross-partition edge matrix. It next performs low- 
rank approximation for the cross-partition edge matrix and 
obtains two orthogonal matrices and one diagonal matrix. 
It then computes the product of the within-partition edge 
matrix and the orthogonal matrices. 

The off-line process of NBJLIN first performs low-rank 
approximation for the adjacency matrix and obtains two 
orthogonal matrices and one diagonal matrix. It then com- 
putes the product of these matrices. 

Both BJLIN and NBJLIN hold the matrix product and 
two orthogonal matrices to compute the proximities. The 
matrix product and orthogonal matrices have size of 0(n 2 ). 
Therefore, BJLIN and NBJLIN both require 0{n 2 ) space. 

Next, we prove that BJLIN and NBJLIN both need 0(n 2 ) 
time. They compute the proximities of nodes by multiplying 
the vector q, the matrix product, and orthogonal matrices. 
Even though the size of the vector q is O(n), that of the ma- 
trix product and orthogonal matrices is 0(n 2 ). Therefore, 
BJLIN and NBJLIN both require 0(n 2 ) time. □ 



Algorithm 1 Degree reordering 

Input: A, the column normalized adjacent matrix 
Output: A', the reordered matrix of A 

1: arrange nodes in ascending order of their degrees; 

2: compute matrix A' by interchanging the rows and columns of 
matrix A by the degree order; 

3: return A'; 



Algorithm 2 Cluster reordering 

Input: A, the column normalized adjacent matrix 
Output: A , the reordered matrix of A 

1: divide nodes into n partitions Pi, P2, . . . , P K by Louvain method; 
2: create new partition P K + i — 0; 
3: for i : = 1 to K do 

4: remove nodes whose edges cross more than two partitions from 

partition Pi\ 
5: append the removed nodes to P K + i; 
6: end for 

7: compute matrix A' by interchanging the rows and columns of 

matrix A by the partitions; 
8: return A'; 



A.7 Theorem 4 

Proof. To compute the estimation, K-dash holds the 
maximum elements of the matrix A, the previous estimated 
proximity, and the previous proximity. It needs O(n) space 
to hold these values. K-dash keeps the inverse matrices to 
compute the proximities. The number of non-zero elements 
of these matrices is 0(n 2 ) in the worst case. Therefore, it 
requires 0(n 2 ) space to keep the inverse matrices. There- 
fore, our approach requires 0(n 2 ) space to find top-k highest 
proximity nodes. □ 

A. 8 Theorem 5 

Proof. To find the answer nodes, K-dash first constructs 
the breadth-first search tree, and computes the estimated 
proximity of the visited node. It next computes the proxim- 
ity of the node if the node is not pruned by the estimation. 
K-dash needs 0(n + m) time to construct the tree and 0(n) 
time if it can not prune any nodes by the estimation. This 
is because it takes O(l) time to compute the estimation for 
each node (Lemma 3). K-dash needs 0(n 2 ) times to com- 
pute the proximities of all nodes since the inverse matrices 
have size of 0(n 2 ) in the worst case. So K-dash needs 0(n 2 ) 
time to find the top-k highest proximity nodes. □ 

B. ALGORITHMS 

In this section, we show the algorithms for reordering ap- 
proaches and K-dash. 

B.l Reordering approach 

We interchange the rows and columns of matrix A to re- 
duce the number of non-zero elements in the inverse matri- 
ces. Since the inverse matrices problem is NP-complete, 
we take three approximation solutions to the problem: de- 
gree reordering, cluster reordering, and hybrid reordering. 

Algorithm 1 depicts our degree reordering approach. This 
approach reduces non-zero elements by arranging low de- 
gree nodes to the upper/left elements in matrix A. It first 
computes the degrees of all nodes and arranges the nodes 
according to their degrees (line 1). It then computes the 
reordered matrix by the degree order (line 2). 



451 



Algorithm 3 Hybrid reordering 

Input: A, the column normalized adjacent matrix 
Output: A', the reordered matrix of A 

1: divide nodes into K + 1 partitions Pi, P2, . . . , P K +l by cluster 

reordering; 
2: for i :— 1 to k + 1 do 

3: arrange nodes in the z-th partition in ascending order of their 

degrees; 
4: end for 

5: compute matrix A' by interchanging the rows and columns of 

matrix A by the partitions and the degree order; 
6: return A'; 



Algorithm 4 K-dash 

Input: q, the query node 

K , the number of answer nodes 
L — 1 , the inverse matrix of L 
U — 1 , the inverse matrix of U 
Output: V a , the set of answer nodes 
1: = 0; 
2: V s = 0; 
3: V a = 0; 

4: append K dummy nodes to V a '■ 
5: compute the breadth-first search tree of node q; 
6: while V s ^ V do 



7: u :— argmin(Z t , \v G V\V S ); 

8: compute the estimate proximity of node u, p u ; 

9: if p u < 6 then 
10: return V a : 

11: else 

12: compute the proximity, p u , by L _1 and U~ 

13: if p„ > then 

14: v :— argmin(p™ \w 6 V a )\ 

15: remove node v from V a ; 

16: append node u to V a ; 

17: := min(p„|«i e V a ); 

18: end if 

19: end if 

20: append node u to V s ; 



21: end while 
22: return V a ; 



We show our cluster reordering approach in Algorithm 2. 
It reduces non-zero elements in the inverse matrices by trans- 
ferring nodes whose edges cross partitions into the ft + 1-th 
partition. It first partitions the graph into n partitions by 
Louvain method (line 1). It checks each node as to whether 
the node has any cross-partition edges. If the node has cross- 
partition edges, it transfers the node to n + 1-th partition 
(lines 3-6). It finally computes the reordered matrix by the 
partitions (line 7). 

Algorithm 3 shows our hybrid reordering approach. It 
combines the above two approaches. It first obtains the 
reordered matrix by the cluster reordering approach (line 
1). It then arranges nodes in each partition by their degrees 
(line 2-4). It finally computes the reordered matrix by the 
partitions and the degree order (line 5). 

B.2 Search algorithm 

Our main approach to finding the answer nodes is to com- 
pute the proximities of selected nodes by the inverse matri- 
ces, and to use the estimated proximities to skip unnecessary 
proximity computations. 

Algorithm 4 shows the search algorithm that efficiently 
finds K highest proximity nodes for the query node. In this 
algorithm, 9 and V a indicate the A-th highest proximity 
among the candidate nodes and the set of candidate/answer 
nodes, respectively. 

In the search process, K-dash first sets the candidate nodes 
by appending A dummy nodes where the proximities of 
the dummy nodes are all (line 4), it then constructs the 



breadth-first search tree (line 5). K-dash then visits nodes 
according to the tree layer one by one (line 7), and com- 
putes the estimated proximity of each node (line 8). If the 
estimated proximity of the visited node is lower than 9, the 
node cannot be the answer node (Lemma 1), and the prox- 
imities of all other unselected nodes cannot be higher than 9 
(Lemma 2). Therefore it terminates the search process (lines 
9-10). Otherwise, the visited node may be an answer node. 
Thus it computes the proximity of the node (line 12). If the 
computed proximity is higher than 9, it updates the candi- 
date set, V a , and 9 (lines 13-18). It returns the candidate 
set, V a , as the answer nodes set (line 22). 

As shown in Algorithm 4, this algorithm automatically 
terminates the process if the estimated proximity is lower 
than 9. That is, this algorithm does not require any user- 
defined inner-parameters. 

C. EXPERIMENTAL DATASETS 

We used the following five public datasets: 

• Dictionary 8 : This dataset was taken from word net- 
work in FOLDOC 9 . FOLDOC is a famous on-line dic- 
tionary of computing subjects. An edge from node u to 
node v exists in the graph if and only if in the FOLDOC 
dictionary term v is used to describe the meaning of 
term u. The number of nodes and edges are 13, 356 
and 120, 238, respectively. 

• Internet 10 : We used a snapshot of the structure of 
the Internet at the level of autonomous systems. This 
graph was constructed from BGP tables posted by the 
University of Oregon Route Views Project 11 . Ore- 
gon Route Views Project allows Internet users to view 
global BGP routing information from the perspective 
of other locations around the Internet. The number of 
nodes and edges are 22, 963 and 48, 436, respectively. 

• Citation 12 : This graph is weighted network of co- 
authorships between scientists posting preprints on the 
Condensed Matter E-Print Archive 13 . The Condensed 
Matter E-Print Archive is the fully automated e-print 
archive for condensed matter preprints which is a spe- 
cialized field in physics. The number of nodes and edges 
are 31,163 and 120,029, respectively. 

• Social 14 : This graph is taken from Epinions.com 15 . 
Epinions.com is a general consumer review site. Mem- 
bers of the site can decide whether to trust each other. 
This dataset is who-trust-whom online social network 
which has 131,828 nodes and 841,372 edges. 

• Email 16 : The graph was generated using email data 
from a large European research institution. In this 
graph, each node corresponds to an email address. And 
a directed edge between nodes u and v represents user 
of address u sent at least one message to address v. 
This dataset has 265, 214 nodes and 420, 045 edges. 

http:/ /vlado.fmf.uni-lj.si/pub/networks/data/dic/foldoc/foldoc.zip 
9 http:/ /foldoc.org/ 

"°http:/ /www-personal. umich.edu/ mcjn/nctdata/ as-22july06.zip 
""'"http:/ /routeviews.org/ 

^http: / /www-personal. umich.edu/ mejn/netdata/cond-mat-2003.zip 

http: / / arxiv.org/ arehivc/cond-mat 
^http: / / snap.stanford.edu/data/soc-sign-epinions.html 

http: / /www. opinions, com/ 
6 http:/ /snap. stanford.edu/data/email-EuAll. html 
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Table 2: Ranked lists by K-dash and NB_LIN for company and operating system names. 



Term 


Method 


Rank 


1 


2 


3 


4 


5 


Microsoft 


K-dash 


Microsoft 


MS-DOS 


IBM PC 


Microsoft Windows 


Microsoft Corporation 


JNB_L11N 


Microsoft 


microsecond 


CS-Prolog 


MICRO SAINT 


Microsoft Basic 


APPLE 


K-dash 


APPLE 


Apple Attachment 
Unit Interface 


Apple II 


Apple Computer, Inc. 


APPC 


NB.LIN 


APPLE 


APIC 


Personalized Array 
Translator 


I-APL 


CEEMAC+ 




K-dash 


Microsoft Windows 


W2K 


Windows/386 


Windows 3.0 


Windows 3.11 


NB.LIN 


Microsoft Windows 


Microsoft Networking 


Microsoft Network 


W2K 


Thumb 


Mac OS 


K-dash 


Mac OS 


Macintosh user inter- 
face 


Macintosh file system 


multitasking 


Macintosh Operating 
System 


NB.LIN 


Mac OS 


Rhapsody 


SORCERER 


Macintosh Operating 
System 


PowcrOpen Associa- 
tion 


Linux 


K-dash 


Linux 


Linux Documentation 
Project 


Unix 


lint 


Linux Network Admin- 
istrators' Guide 


NB.LIN 


Linux 


Linux Documentation 
Project 


SL5 


debianize 


SLANG 



K-dash' 
Random 








" n 









Dictionary Internet Citation Social Email 



Figure 9: Comparison of root node selection. 

D. ADDITIONAL EXPERIMENTS 

In this section, we show the result of additional experi- 
ment on root node selection and case-studies on two com- 
pany names and three operating system names. 

D.l Root node selection 

Our estimate algorithm sets the query node as the root 
node to find the top-k nodes efficiently. To show the effec- 
tiveness of this idea, we show the number of proximity com- 
putations in Figure 9. In this figure, Random represents the 
case of selecting the root node at random. 

Our root node selection method requires fewer proximity 
computations than the random methods. In the search pro- 
cess, we first compute the breadth-first search tree from the 
root nodes. As the query node and its neighboring nodes 
have high proximity, we can obtain high proximity nodes 
with this approach. As a result, we can more effectively 
estimate the proximities of unselected nodes. 

D.2 Case-studies 

In this section, we show results of experimental case-studies 
on two company names and three operating system names 
to show the effectiveness of K-dash. In this experiment, 
we identified the high proximity terms for 'Microsoft', 'AP- 
PLE', 'Microsoft Windows', 'Mac OS', and 'Linux'. We set 
the target rank of SVD to 1, 000 for NBJLIN. Table 2 shows 
the results. We omitted the results of Basic Push Algorithm 
due to space limitations. 

The results of our method for the two company names 
make sense while those of the approximate method do not. 
For example, K-dash successfully detected the formal names 



of the two companies, 'Microsoft Corporation' for 'Microsoft' 
and 'Apple Computer, Inc.' for 'APPLE'. K-dash detected 
'IBM PC as the third-relevant term for 'Microsoft'. This 
result may seem strange, however, it is very reasonable if wc 
consider the close relation of the companies. Microsoft was 
founded in 1975 by Bill Gates. In 1980, IBM chose Microsoft 
to supply the operating system for the IBM PC. As a result, 
Microsoft eventually became the leading vendor. For Apple, 
K-dash finds APPLE II as the third-relevant term which is 
an 8-bit PC of the company. APPLE II was invented by 
Steve Wozniak, who is co-founder of Apple with Steve Jobs, 
and was very popular from about 1980 until the first several 
years of MS-DOS. However, the approximate approach has 
difficulty in obtaining these intuitive results. 

The results of K-dash for the three operating system names 
reveal that these operating systems have distinctive char- 
acteristics. The results of 'Microsoft Windows' are 'Mi- 
crosoft Windows', 'W2K', 'Windows/386', 'Windows 3.0', 
and 'Windows 3.11'. All are Microsoft operating systems. 
These results reflect Microsoft's dominant market position. 
On the other hand, the results of 'Mac OS' include several 
of Apple's unique technical terms. For example, K-dash de- 
tects 'Macintosh user interface' as the second-relevant term 
for 'Mac OS'. Macintosh user interface is the graphical user 
interface used by Apple Computer's Macintosh family of 
PCs. The original Macintosh was the first successful PC 
to use a graphical user interface devoid of a command line. 
'Macintosh file system', the third-relevant term for 'Mac 
OS', is Apple's disk file system adopted only in Mac OS. 
These results reflect Apple's culture of creativity. The re- 
sults of 'Linux' imply its open culture. Unlike Microsoft 
Windows, the development of Linux is an example of free 
and open source software collaboration. Therefore, there 
are many projects that support the development of Linux. 
Linux Documentation Project is an all- volunteer project that 
maintains a large collection of GNU and Linux-related doc- 
umentation and publishes the collection on-line. 

In conclusion, the results of K-dash are strong indicative 
of the characteristics of each company and operating system. 
The approximate approach is ineffective in finding such high 
relevant terms for the two company names and the three 
operating system names. 



453 



